Searching for Patterns of Thermostability in Proteins and Defining the Main Features Contributing to Enzyme Thermostability through Screening, Clustering, and Decision Tree Algorithms
نویسندگان
چکیده
Finding or making thermostable enzymes has been identified as an important goal in a number of different industries. Therefore, understanding the features involved in enzyme thermostability is crucial, and different approaches have been used to extract or manufacture thermostable enzymes. Herein we examined features that contribute to the thermostability of 2,946 proteins. We used various screening techniques (anomaly detection, feature selection), clustering methods (K-Means, TwoStep cluster), decision tree models (Classification and Regression Tree, CHAID, Exhaustive CHAID, QUEST, C5.0), and generalized rule induction (association) (GRI) models to search for patterns of thermostability and to find features that contribute to enzyme thermal stability. We found that Arg as the N-terminal amino acid was found solely in proteins working at temperatures higher than 70 oC. Fifty-four protein features were shown to be important in feature selection modeling, and the number of peer groups with an anomaly index of 2.12 declined from 7 to 2 after being run using only important selected features; however, no changes were found in the numbers of groups when K-Means and TwoStep clustering modeling was performed on datasets with/without feature selection filtering. The depth of the trees generated by various decision tree models varied from 14 (in the C5.0 model with 10-fold cross-validation and with feature selection of the dataset) to 4 (in CHAID models) branches. The performance evaluation of the decision tree models tested here showed that C5.0 was the best and the Quest model was the worst. We did not find any significant difference in the percent of correctness, performance evaluation, and mean correctness of various decision tree models when feature selected datasets were used, but the number of peer groups in clustering models was reduced significantly (p<0.05) compared to datasets without feature selection. In all decision tree models, the frequency of Gln was the most important feature for decision tree rule sets; moreover, in all GRI association rules (100 rules), the frequency of Gln was used in antecedent to support the rules. The importance of Gln in protein thermostability is discussed in this paper.
منابع مشابه
Engineering Thermostable Enzymes; Application of Unsupervised Clustering Algorithms
There is a high demand for engineering thermostable enzymes in some industries; especially in paper industries to use environmental friendly enzymes instead of toxic chlorine chemicals. Hence, understanding protein attributes involved in enzyme thermostability is important. Herein, the most important protein features contributing to enzyme thermostability was searched by using data mining algor...
متن کاملAn expert system to predict protein thermostability using decision tree
Protein thermostability information is closely linked to commercial production of many biomaterials. Recent developments have shown that amino acid composition, special sequence patterns and hydrogen bonds, disulfide bonds, salt bridges and so on are of considerable importance to thermostability. In this study, we present a system to integrate these various factors that predict protein thermost...
متن کاملIncreasing Performance and Thermostability of D-Phenylglycine Aminotransferase in Miscible Organic Solvents
Background: D-Phenylglycine aminotransferase (D-PhgAT) is highly beneficial in pharmaceutical biotechnology. Like many other enzymes, D-PhgAT suffers from low stability under harsh processing conditions, poor solubility of substrate, products and occasional microbial contamination. Incorporation of miscible organic solvents into the enzyme’s reaction is considered as a solution...
متن کاملافزایش ویژگیهای عملیاتی آنزیم اندوگلوکاناز از طریق تغییر اسیدآمینهای
Background & Aims : Ethanol produced from plant cellulose is called bioethanol and is recognized as a unique sustainable liquid fuel with powerful economic and environmental effects. In the present study we aimed at integrate a cellulase gene in to yeast genome to have the enzyme secreted out of the cell. Subsequently cellulose is depredated to glucose by the enzyme, and then it is ferment ...
متن کاملSignal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases
Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...
متن کامل